SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research
Semester 1, 2026
Last updated: 2026-01-23
I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.
By the end of this lecture, you will be able to:
TSwD
ROS
Count data are observations that can only take non-negative integer values: 0, 1, 2, 3, …
Examples in social science:
Linear regression assumes:
Problems with count data:
Key Issue
Using linear regression for count data can produce impossible predictions (like -3.5 crimes) and incorrect standard errors.
The Poisson distribution is designed for count data. It is governed by a single parameter, λ (lambda), which represents both the mean and variance.
\[P_\lambda(k) = \frac{e^{-\lambda} \lambda^k}{k!}, \text{ for } k = 0, 1, 2, ...\]
Key property: In the Poisson distribution, the mean equals the variance.
\[E(Y) = Var(Y) = \lambda\]
[1] 2 1 3 2 0 2 1 2 1 1 6 4 1 1 2 0 2 2 2 1
In Poisson regression, we model the expected count as a function of predictors:
\[y_i \sim \text{Poisson}(e^{X_i\beta})\]
Or equivalently:
\[\log(\lambda_i) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ...\]
The Log Link Function
We use the logarithm as the link function because:
Because we use a log link, coefficients are interpreted as multiplicative effects on the expected count.
For a coefficient β:
Quick Approximation
For small coefficients (|β| < 0.1), the coefficient approximately equals the percentage change. For example, β = 0.05 ≈ 5% increase.
set.seed(853)
n <- 100
x <- runif(n, -2, 2)
a <- 1
b <- 0.5
linpred <- a + b * x
y <- rpois(n, exp(linpred))
sim_data <- tibble(x = x, y = y)
ggplot(sim_data, aes(x = x, y = y)) +
geom_point(alpha = 0.6, size = 3) +
stat_function(
fun = function(x) exp(a + b * x),
colour = "red", linewidth = 1
) +
labs(x = "Predictor (x)",
y = "Count (y)",
title = "Simulated Poisson Data") +
theme_minimal(base_size = 16)Using glm() with family = poisson:
Call:
glm(formula = y ~ x, family = poisson(link = "log"), data = sim_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.98083 0.06392 15.345 <2e-16 ***
x 0.51677 0.05599 9.229 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 176.535 on 99 degrees of freedom
Residual deviance: 86.251 on 98 degrees of freedom
AIC: 343.93
Number of Fisher Scoring iterations: 5
Let’s simulate data about the number of A grades awarded in university courses across three departments:
set.seed(853)
class_size <- 26
count_of_A <- tibble(
department = c(rep("1", 26), rep("2", 26), rep("3", 26)),
course = c(paste0("DEP_1_", letters),
paste0("DEP_2_", letters),
paste0("DEP_3_", letters)),
number_of_As = c(
rpois(n = class_size, lambda = 5),
rpois(n = class_size, lambda = 10),
rpois(n = class_size, lambda = 20)
)
)stan_glm
family: poisson [log]
formula: number_of_As ~ department
observations: 78
predictors: 3
------
Median MAD_SD
(Intercept) 1.3 0.1
department2 0.9 0.1
department3 1.7 0.1
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
Overdispersion occurs when the observed variance in the data exceeds what the model predicts.
For Poisson regression:
Why Does This Matter?
Overdispersion leads to:
Compare the mean and variance of your count variable:
# A tibble: 1 × 3
mean variance ratio
<dbl> <dbl> <dbl>
1 11.2 62.3 5.57
A ratio much greater than 1 suggests overdispersion.
Unobserved heterogeneity: Important variables are missing from the model
Clustering: Observations within groups are correlated
Excess zeros: More zeros than the Poisson distribution predicts
Outliers: A few extreme values
Wrong distributional assumption: The data-generating process isn’t Poisson
The negative binomial distribution adds an extra parameter to allow for overdispersion:
\[Var(Y) = E(Y) + \frac{E(Y)^2}{\phi}\]
where φ (phi) is the “reciprocal dispersion” parameter.
Flexibility
The negative binomial allows the variance to be larger than the mean, which better fits most real-world count data.
Using rstanarm:
stan_glm
family: neg_binomial_2 [log]
formula: y ~ x
observations: 100
predictors: 2
------
Median MAD_SD
(Intercept) 0.9 0.1
x 0.5 0.1
Auxiliary parameter(s):
Median MAD_SD
reciprocal_dispersion 2.7 0.8
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
| Term | Poisson | Negative Binomial |
|---|---|---|
| (Intercept) | 0.94 (0.07) | 0.94 (0.09) |
| x | 0.46 (0.06) | 0.46 (0.08) |
Real-world example from the textbook: examining leading causes of death in Alberta.
| cause | Min | Mean | Max | SD | Var |
|---|---|---|---|---|---|
| Heart disease | 680 | 1046.1905 | 1427 | 201.5963 | 40641.06 |
| Dementia | 897 | 1791.0952 | 2673 | 415.0024 | 172226.99 |
| Lung cancer | 1120 | 1510.5238 | 2057 | 242.0852 | 58605.26 |
| Chronic lower respiratory | 561 | 917.5238 | 1120 | 153.1723 | 23461.76 |
| Stroke | 457 | 603.0000 | 855 | 117.9178 | 13904.60 |
Use Poisson when:
Use Negative Binomial when:
Rule of Thumb
When in doubt, start with negative binomial. It’s safer and reduces to Poisson when overdispersion is minimal.
Sometimes counts occur over different exposure periods or populations.
Examples:
To model rates rather than raw counts, we use an offset:
\[y_i \sim \text{Poisson}(u_i \cdot \theta_i)\]
where \(u_i\) is the exposure and \(\theta_i = e^{X_i\beta}\) is the rate.
Taking logs:
\[\log(\lambda_i) = \log(u_i) + X_i\beta\]
The log of the exposure, \(\log(u_i)\), is called the offset.
From ROS Chapter 15: A study of pest management in apartments.
So far, we’ve treated all observations as independent (complete pooling).
But data often has hierarchical structure:
Why Does This Matter?
Ignoring grouping structure can lead to:
Complete Pooling
Treat all observations as one group.
Problem: Ignores group differences.
No Pooling
Fit separate models for each group.
Problem: Ignores information shared across groups.
Partial Pooling
Allow groups to differ while “borrowing strength” from other groups.
Multilevel modelling!
Multilevel models (also called hierarchical models, random effects models) allow:
Basic model with varying intercepts:
\[y_{ij} = \beta_0 + \alpha_j + \beta_1 x_{ij} + \epsilon_{ij}\]
where \(\alpha_j \sim N(0, \sigma^2_\alpha)\) is the random effect for group \(j\).
The (1 | state) term specifies varying intercepts by state.
stan_glmer
family: binomial [logit]
formula: supports ~ gender + (1 | state)
observations: 500
------
Median MAD_SD
(Intercept) -1.4 0.3
gender 0.8 0.2
Error terms:
Groups Name Std.Dev.
state (Intercept) 1
Num. levels: state 10
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
Consider multilevel modelling when:
Practical Advice
If you have groups with at least 5-10 observations each and want to account for group-level variation, consider multilevel modelling.
All the models we’ve covered fit within the Generalised Linear Model framework:
| Distribution | Link | Inverse Link | Use Case |
|---|---|---|---|
| Normal | Identity | \(\mu = X\beta\) | Continuous data |
| Binomial | Logit | \(\mu = \text{logit}^{-1}(X\beta)\) | Binary outcomes |
| Poisson | Log | \(\mu = e^{X\beta}\) | Count data |
| Neg. Binomial | Log | \(\mu = e^{X\beta}\) | Overdispersed counts |
| Outcome Type | Model | R Function |
|---|---|---|
| Continuous | Linear regression | lm() or stan_glm() |
| Binary (0/1) | Logistic regression | glm(family=binomial) or stan_glm() |
| Count (no upper limit) | Poisson regression | glm(family=poisson) or stan_glm() |
| Count (overdispersed) | Negative binomial | MASS::glm.nb() or stan_glm(family=neg_binomial_2) |
| Count (with exposure) | Poisson/NB with offset | Add offset=log(exposure) |
Count data requires special models (Poisson or negative binomial)
Overdispersion is common - the negative binomial handles it well
Offset terms allow modelling rates when exposure varies
Coefficients are multiplicative (exponentiate to interpret)
Multilevel models account for hierarchical data structure
Remember
Always check model assumptions and use posterior predictive checks to validate your model choice!
# Poisson regression
glm(y ~ x, family = poisson(link = "log"), data = df)
stan_glm(y ~ x, family = poisson(link = "log"), data = df)
# Negative binomial regression
MASS::glm.nb(y ~ x, data = df)
stan_glm(y ~ x, family = neg_binomial_2(link = "log"), data = df)
# With offset for rates
stan_glm(y ~ x, family = poisson, offset = log(exposure), data = df)
# Multilevel model with varying intercepts
stan_glmer(y ~ x + (1 | group), family = poisson, data = df)Week 11: Surveys and Experimental Design
Readings: